首页> 外文OA文献 >Toward Optimal Feature Selection in Naive Bayes for Text Categorization
【2h】

Toward Optimal Feature Selection in Naive Bayes for Text Categorization

机译:用朴素贝叶斯进行文本分类的最优特征选择

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Automated feature selection is important for text categorization to reducethe feature size and to speed up the learning process of classifiers. In thispaper, we present a novel and efficient feature selection framework based onthe Information Theory, which aims to rank the features with theirdiscriminative capacity for classification. We first revisit two informationmeasures: Kullback-Leibler divergence and Jeffreys divergence for binaryhypothesis testing, and analyze their asymptotic properties relating to type Iand type II errors of a Bayesian classifier. We then introduce a new divergencemeasure, called Jeffreys-Multi-Hypothesis (JMH) divergence, to measuremulti-distribution divergence for multi-class classification. Based on theJMH-divergence, we develop two efficient feature selection methods, termedmaximum discrimination ($MD$) and $MD-\chi^2$ methods, for text categorization.The promising results of extensive experiments demonstrate the effectiveness ofthe proposed approaches.
机译:自动特征选择对于文本分类对于减小特征尺寸并加快分类器的学习过程非常重要。在本文中,我们提出了一种基于信息论的新颖而有效的特征选择框架,旨在以特征的分类能力对特征进行排序。我们首先回顾两个信息量度:用于二元假设检验的Kullback-Leibler散度和Jeffreys散度,并分析它们与贝叶斯分类器的I型和II型误差有关的渐近性质。然后,我们引入一种称为Jeffreys-Multi-Hypothesis(JMH)散度的新散度度量,以测量用于多类分类的多重分布散度。基于JMH散度,我们开发了两种有效的特征选择方法,即最大区分度($ MD $)和$ MD- \ chi ^ 2 $方法,用于文本分类。大量实验的有希望的结果证明了该方法的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号